Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability
نویسندگان
چکیده
BACKGROUND Even though circular fingerprints have been first introduced more than 50 years ago, they are still widely used for building highly predictive, state-of-the-art (Q)SAR models. Historically, these structural fragments were designed to search large molecular databases. Hence, to derive a compact representation, circular fingerprint fragments are often folded to comparatively short bit-strings. However, folding fingerprints introduces bit collisions, and therefore adds noise to the encoded structural information and removes its interpretability. Both representations, folded as well as unprocessed fingerprints, are often used for (Q)SAR modeling. RESULTS We show that it can be preferable to build (Q)SAR models with circular fingerprint fragments that have been filtered by supervised feature selection, instead of applying folded or all fragments. Compared to folded fingerprints, filtered fingerprints significantly increase predictive performance and remain unambiguous and interpretable. Compared to unprocessed fingerprints, filtered fingerprints reduce the computational effort and are a more compact and less redundant feature representation. Depending on the selected learning algorithm filtering yields about equally predictive (Q)SAR models. We demonstrate the suitability of filtered fingerprints for (Q)SAR modeling by presenting our freely available web service Collision-free Filtered Circular Fingerprints that provides rationales for predictions by highlighting important structural features in the query compound (see http://coffer.informatik.uni-mainz.de). CONCLUSIONS Circular fingerprints are potent structural features that yield highly predictive models and encode interpretable structural information. However, to not lose interpretability, circular fingerprints should not be folded when building prediction models. Our experiments show that filtering is a suitable option to reduce the high computational effort when working with all fingerprint fragments. Additionally, our experiments suggest that the area under precision recall curve is a more sensible statistic for validating (Q)SAR models for virtual screening than the area under ROC or other measures for early recognition. GRAPHICAL ABSTRACT
منابع مشابه
PREDICTION OF SLOPE STABILITY STATE FOR CIRCULAR FAILURE: A HYBRID SUPPORT VECTOR MACHINE WITH HARMONY SEARCH ALGORITHM
The slope stability analysis is routinely performed by engineers to estimate the stability of river training works, road embankments, embankment dams, excavations and retaining walls. This paper presents a new approach to build a model for the prediction of slope stability state. The support vector machine (SVM) is a new machine learning method based on statistical learning theory, which can so...
متن کاملPreventing Key Performance Indicators Violations Based on Proactive Runtime Adaptation in Service Oriented Environment
Key Performance Indicator (KPI) is a type of performance measurement that evaluates the success of an organization or a partial activity in which it engages. If during the running process instance the monitoring results show that the KPIs do not reach their target values, then the influential factors should be identified, and the appropriate adaptation strategies should be performed to prevent ...
متن کاملTFP: Time-Sensitive, Flow-Specific Profiling at Runtime
Program profiling can help performance prediction and compiler optimization. This paper describes the initial work behind TFP, a new profiling strategy that can gather and verify a range of flow-specific information at runtime. While TFP can collect more refined information than block, edge or path profiling, it is only 5.75% slower than a very fast runtime path-profiling technique. Statistics ...
متن کاملPrediction of Deformation of Circular Plates Subjected to Impulsive Loading Using GMDH-type Neural Network
In this paper, experimental responses of the clamped mild steel, copper, and aluminium circular plates are presented subjected to blast loading. The GMDH-type neural networks (Group Method of Data Handling) are then used for the modelling of the mid-point deflection thickness ratio of the circular plates using those experimental results. The aim of such modelling is to show how the mid-point de...
متن کاملPredicting the Metabolic Sites by Flavin-Containing Monooxygenase on Drug Molecules Using SVM Classification on Computed Quantum Mechanics and Circular Fingerprints Molecular Descriptors
As an important enzyme in Phase I drug metabolism, the flavin-containing monooxygenase (FMO) also metabolizes some xenobiotics with soft nucleophiles. The site of metabolism (SOM) on a molecule is the site where the metabolic reaction is exerted by an enzyme. Accurate prediction of SOMs on drug molecules will assist the search for drug leads during the optimization process. Here, some quantum m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 8 شماره
صفحات -
تاریخ انتشار 2016